We investigate a dataset of 102 bulk RNA-seq samples extracted from the prefrontal cortex of 102 deceased individuals. The dataset is provided by H Rehrauer of FGCZ (details). Roughly half of the samples are from patients diagnosed with ALS, the remaining ones are referred to as the control group. We attempt to infer cell type proportions hidden in each bulk RNA-seq sample. This procedure, known as cell type deconvolution (overview), should be integral to inference from the sample to disease condition. We focus here on reference-based deconvolution: in essence, if \(X\) is a bulk RNA-seq count vector, and \(a, b, \ldots\) are single-cell RNA-seq count vectors of representatives of relevant cells types, one aims to find non-negative coefficients \(\alpha, \beta, \ldots\) such that \(X \approx \alpha a + \beta b + \ldots\). Unfortunately, this reconstruction is confounded by batch effects such as difference in sequencing platforms, in particular different statistical properties of the technical noise of bulk versus single-cell RNA-seq.
As for cell type reference datasets we resort to
the single-cell RNA-seq dataset Darmanis et al (~300 labeled cells used), and to
the single-nucleus RNA-seq dataset Allen Brain M1 (>70k labeled cells).
The two datasets are compared here.
We apply the following packages/algorithms:
Bisque (R) prior to estimating cell type proportions transforms the bulk gene expression gene-wise to match the distribution from the reference dataset. It expects the reference single cells to come from several individuals. Links: paper, vignette, manual, github, ReferenceBasedDecomposition (uses limSolve).
MuSiC (R) “up-weighs genes with low cross-subject variance” and vice versa, and recursively estimates the proportion of clusters of similar cell types using genes of low within-cluster variance. Links: paper, tutorial, github, music_prop.
NNLS is a barebones Python implementation of cell type proportion reconstruction using non-negative least squares. Links: code.
RNA-Sieve (Python/Mathematica) incorporates the dependence of the observation noise on cell type composition of the bulk sample into the maximum likelihood estimator, and can also provide confidence intervals. Links: preprint, github, code.
DWLS (R) identifies differentially expressed genes as markers and averages those for each cell type. The absolute square residual in the usual NLLS is replaced by a relative one, requiring an iterative procedure to find a minimizer. Links: paper, github, bitbucket, quickstart, manual.
We were unable to run SCDC.
This script:
suppressPackageStartupMessages({
# install.packages("kableExtra")
requireNamespace("kableExtra")
# install.packages("dplyr")
library(dplyr)
# install.packages("ggplot2")
library(ggplot2)
# install.packages("pheatmap")
requireNamespace("pheatmap")
# install.packages("stringr")
requireNamespace("stringr")
# install.packages("reshape2")
requireNamespace("reshape2")
# install.packages("pbapply")
requireNamespace("pbapply")
# install.packages("pathlibr")
requireNamespace("pathlibr")
# avoid importing `exprs` that leads to clashes
requireNamespace("rlang")
# https://vroom.r-lib.org/articles/vroom.html
# install.packages("vroom")
requireNamespace("vroom")
# Provides `ExpressionSet` data structure
# install.packages("BiocManager")
# BiocManager::install("Biobase")
requireNamespace("Biobase")
#
# Deconvolution packages follow
#
# install.packages("BisqueRNA")
requireNamespace("BisqueRNA")
# remotes::install_github("renozao/xbioc@1354168")
requireNamespace("xbioc") # For MuSiC and such
# https://xuranw.github.io/MuSiC/articles/MuSiC.html
# remotes::install_github("xuranw/MuSiC@01e51ba")
requireNamespace("MuSiC")
# We need to expose some functions from those packages for DWLS:
# BiocManager::install("ROCR")
# BiocManager::install("MAST")
requireNamespace("ROCR")
requireNamespace("MAST")
# If installed afresh, this takes ages and then some:
# remotes::install_bitbucket("yuanlab/dwls@f13dcf9")
requireNamespace("DWLS")
})
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=lzh_TW
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=lzh_TW LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=lzh_TW LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=lzh_TW LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.3 dplyr_1.0.2
##
## loaded via a namespace (and not attached):
## [1] backports_1.2.1 plyr_1.8.6
## [3] igraph_1.2.6 lazyeval_0.2.2
## [5] splines_3.6.3 BiocParallel_1.20.1
## [7] listenv_0.8.0 scattermore_0.7
## [9] GenomeInfoDb_1.22.1 digest_0.6.27
## [11] htmltools_0.5.0 magrittr_2.0.1
## [13] checkmate_2.0.0 memoise_1.1.0
## [15] tensor_1.5 cluster_2.1.0
## [17] ROCR_1.0-11 globals_0.14.0
## [19] matrixStats_0.57.0 vroom_1.3.2
## [21] MCMCpack_1.4-9 BisqueRNA_1.0.4
## [23] colorspace_2.0-0 blob_1.2.1
## [25] rvest_0.3.6 ggrepel_0.9.0
## [27] xfun_0.20 crayon_1.3.4
## [29] RCurl_1.98-1.2 jsonlite_1.7.2
## [31] spatstat.data_1.7-0 spatstat_1.64-1
## [33] MuSiC_0.1.1 survival_3.1-8
## [35] zoo_1.8-8 glue_1.4.2
## [37] polyclip_1.10-0 kableExtra_1.3.1
## [39] registry_0.5-1 gtable_0.3.0
## [41] nnls_1.4 zlibbioc_1.32.0
## [43] XVector_0.26.0 webshot_0.5.2
## [45] MatrixModels_0.4-1 leiden_0.3.6
## [47] DelayedArray_0.12.3 future.apply_1.7.0
## [49] SingleCellExperiment_1.8.0 BiocGenerics_0.32.0
## [51] abind_1.4-5 SparseM_1.78
## [53] scales_1.1.1 pheatmap_1.0.12
## [55] DBI_1.1.0 miniUI_0.1.1.1
## [57] Rcpp_1.0.5 DWLS_0.1
## [59] viridisLite_0.3.0 xtable_1.8-4
## [61] reticulate_1.18 rsvd_1.0.3
## [63] bit_4.0.4 stats4_3.6.3
## [65] htmlwidgets_1.5.3 httr_1.4.2
## [67] RColorBrewer_1.1-2 ellipsis_0.3.1
## [69] Seurat_3.2.3 ica_1.0-2
## [71] reshape_0.8.8 pkgconfig_2.0.3
## [73] uwot_0.1.10 deldir_0.2-3
## [75] tidyselect_1.1.0 rlang_0.4.10
## [77] reshape2_1.4.4 later_1.1.0.1
## [79] AnnotationDbi_1.48.0 munsell_0.5.0
## [81] tools_3.6.3 generics_0.1.0
## [83] RSQLite_2.2.1 xbioc_0.1.19
## [85] ggridges_0.5.3 evaluate_0.14
## [87] stringr_1.4.0 fastmap_1.0.1
## [89] goftest_1.2-2 yaml_2.2.1
## [91] mcmc_0.9-7 knitr_1.30
## [93] bit64_4.0.5 fitdistrplus_1.1-3
## [95] purrr_0.3.4 RANN_2.6.1
## [97] nlme_3.1-144 pbapply_1.4-3
## [99] future_1.21.0 mime_0.9
## [101] quantreg_5.75 xml2_1.3.2
## [103] compiler_3.6.3 rstudioapi_0.13
## [105] png_0.1-7 plotly_4.9.2.2
## [107] e1071_1.7-4 spatstat.utils_1.20-2
## [109] tibble_3.0.4 stringi_1.5.3
## [111] lattice_0.20-40 Matrix_1.2-18
## [113] vctrs_0.3.6 pillar_1.4.7
## [115] lifecycle_0.2.0 BiocManager_1.30.10
## [117] lmtest_0.9-38 RcppAnnoy_0.0.18
## [119] data.table_1.13.6 cowplot_1.1.1
## [121] bitops_1.0-6 irlba_2.3.3
## [123] conquer_1.0.2 httpuv_1.5.4
## [125] patchwork_1.1.1 GenomicRanges_1.38.0
## [127] R6_2.5.0 promises_1.1.1
## [129] gridExtra_2.3 KernSmooth_2.23-16
## [131] IRanges_2.20.2 parallelly_1.23.0
## [133] codetools_0.2-16 MASS_7.3-51.5
## [135] assertthat_0.2.1 SummarizedExperiment_1.16.1
## [137] MAST_1.12.0 pkgmaker_0.32.2.900
## [139] withr_2.3.0 sctransform_0.3.2
## [141] S4Vectors_0.24.4 GenomeInfoDbData_1.2.2
## [143] mgcv_1.8-31 parallel_3.6.3
## [145] rpart_4.1-15 quadprog_1.5-8
## [147] grid_3.6.3 pathlibr_0.1.0
## [149] tidyr_1.1.2 coda_0.19-4
## [151] class_7.3-15 rmarkdown_2.6
## [153] Rtsne_0.15 Biobase_2.46.0
## [155] shiny_1.5.0
BASEPATH <- pathlibr::Path$new(".")
stopifnot("deconvolution.Rmd" %in% names(BASEPATH$dir))
out_file <- (function(.) BASEPATH$join("output")$join(.)$show)
dir.create(out_file(""), showWarnings = FALSE)
path_to <- (function(.) Sys.glob(BASEPATH$join("../../..")$join(.)$show))
set.seed(43)
# Use first column as index
by_col1 <- (function(.) tibble::column_to_rownames(., colnames(.)[1]))
# Use index as new column `name`
ind2col <- (function(., name) tibble::rownames_to_column(., var = name))
group_by_and_sum <-
function(., column_name) {
(.) %>%
dplyr::group_by(!!rlang::sym(column_name)) %>%
dplyr::summarise_all(sum) %>%
tibble::column_to_rownames(column_name)
}
fix_names <- function(.) {
(.) %>%
# https://stackoverflow.com/a/55184433
rlang::set_names(stringr::str_replace(names(.), "cell type", "celltype"))
}
col_norm2 <- (function(.) t(t(.) / sqrt(colSums((.)**2))))
The function utils::read.delim is too slow to read wide tables (Allen Brain M1 has >70k cells).
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 32)
from_csv <- function(.) {
vroom::vroom(., del = "\t") %>%
by_col1() %>%
suppressMessages()
}
A call to to_csv should be preceded by ind2col("name for index") to write the row names.
to_csv <- function(., f) {
vroom::vroom_write(., out_file(f), delim = "\t")
}
hush <- base::invisible
ggplot2::theme_set(theme_light(base_size = 15))
kable <- function(.) {
kableExtra::kbl(., align = "c") %>%
kableExtra::kable_paper("hover", full_width = F)
}
# Wide plots (inches?)
WIDTH1 <- 20
HEIGHT1 <- 3
hist_theme <-
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.ticks.x = element_blank(),
)
save_barplot <-
function(., filename) {
as.data.frame(.) %>%
ind2col("celltype") %>%
reshape2::melt(
id = "celltype",
var = "sample",
value.name = "y"
) %>%
{
ggplot(., aes(x = sample, y = y, fill = celltype)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Paired") +
ylim(0, 1) +
hist_theme +
theme(
legend.title = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(
a = -90,
vjust = 0.5,
hjust = 0,
size = 10,
)
) +
ggsave(
filename = out_file(filename),
width = WIDTH1,
height = HEIGHT1,
device = "png"
)
} %>%
suppressWarnings()
return(.)
}
save_heatmap <-
function(., filename) {
as.data.frame(.) %>%
pheatmap::pheatmap(
filename = out_file(filename),
cluster_rows = FALSE,
cluster_cols = FALSE,
fontsize_row = 14,
fontsize_col = 10,
width = WIDTH1,
height = HEIGHT1
)
return(.)
}
fgcz_data <-
path_to("data/20201128-FGCZ/*count.zip") %>%
unz(., unzip(., list = TRUE)$Name) %>%
from_csv() %>%
# Collapse ENSG IDs by gene_name:
group_by_and_sum("gene_name")
fgcz_meta <-
from_csv(path_to("data/20201128-FGCZ/*infos.tsv"))
This will be used later.
assert_fgcz_meta_order <-
function(.) {
stopifnot(all(rownames(.) == rownames(fgcz_meta)))
return(.)
}
t(fgcz_data) %>%
assert_fgcz_meta_order() %>%
hush()
darm_data <- from_csv(path_to("data/*/2015-Darmanis/b*/data.csv.gz"))
darm_meta <- from_csv(path_to("data/*/2015-Darmanis/b*/meta.csv.gz")) %>%
fix_names()
exclude_celltypes <- c("fetal_quiescent", "fetal_replicating", "hybrid")
Plot cell type counts.
darm_meta$celltype %>%
data.frame(x = .) %>%
ggplot(aes(x = x, fill = if_else(x %in% exclude_celltypes, "drop", "keep"))) +
geom_bar() +
ggtitle("Cell types in the 'Darmanis' reference dataset") +
scale_y_log10() +
hist_theme +
theme(axis.title.x = element_blank()) +
theme(axis.text.x = element_text(a = 45, hjust = 1)) +
theme(legend.title = element_blank())
Remove unnecessary cell types from the scRNA reference dataset.
darm_meta <-
darm_meta %>%
dplyr::filter(!(celltype %in% exclude_celltypes))
darm_data <-
darm_data %>%
dplyr::select_if(names(.) %in% rownames(darm_meta))
stopifnot(285 == nrow(darm_meta))
stopifnot(285 == ncol(darm_data))
abm1_meta <- from_csv(path_to("data/*/2019-AllenBrain-M1/b*/meta.csv*"))
abm1_data <- from_csv(path_to("data/*/2019-AllenBrain-M1/b*/data.csv*"))
Some sanity checks.
# Number of single cells
stopifnot(ncol(abm1_data) > 70000)
stopifnot(nrow(abm1_meta) > 70000)
# Number of genes and some examples
stopifnot(nrow(abm1_data) == 141)
stopifnot(all(c("PIK3CD", "WNT4", "LDLRAP1") %in% rownames(abm1_data)))
Drop samples with too little expression, otherwise Bisque doesn’t go through.
abm1_data <-
abm1_data %>%
# Keep only genes common with the FGCZ dataset
{
(.)[rownames(.) %in% rownames(fgcz_data), ]
} %>%
# Drop zero columns (dplyr variant is too slow here)
{
(.)[, colSums(.) != 0]
}
abm1_meta <-
abm1_meta %>%
dplyr::filter(rownames(.) %in% colnames(abm1_data))
Plot cell type counts.
data.frame(
x = abm1_meta$celltype,
donor = abm1_meta$donor
) %>%
ggplot(aes(x = x, fill = donor)) +
geom_bar(position = "dodge") +
ggtitle("Cell types in the 'Allen Brain M1' reference dataset") +
scale_fill_brewer(palette = "Set1") +
scale_y_log10() +
hist_theme +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank())
These marker genes were extracted from the Allen Brain M1 dataset. Technically, these should be the same as in abm1_data before filtering. Just in case, we subset them the maximal common subset.
abm1_markergenes <-
path_to("data/*/*AllenBrain-M1/b*/marker_genes.csv") %>%
from_csv() %>%
rownames() %>%
{
(.)[(.) %in% rownames(fgcz_data)]
} %>%
{
(.)[(.) %in% rownames(abm1_data)]
}
stopifnot(117 == length(abm1_markergenes))
Some cross-sections of the bulk dataset can be found here.
Of particular interest for the following are RIN and post-mortem delay. They exhibit a dependence on the source but appear to be uncorrelated.
fgcz_meta %>%
ggplot(aes(x = pmDelay, y = RIN, shape = Condition, color = Source)) +
scale_x_log10() +
labs(
x = "Post-mortem delay (hours)",
y = "RIN",
color = "Sample source",
shape = "Condition"
) +
geom_point(size = 5, alpha = 0.8) +
scale_color_brewer(palette = "Set1")
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 4 rows containing missing values (geom_point).
When we order the samples by post-mortem delay or RIN, we can see short runs of samples where the condition ALS/Control does not change. Thus if we introduce a quality cut-off (e.g. low RIN), we should take care not to bias the number of ALS/Control samples. For example, dropping 14 samples of lowest RIN could be a good choice.
fgcz_meta %>%
{
lapply(
c("RIN", "pmDelay"),
function(field) {
(.) %>%
mutate(value = !!rlang::sym(field)) %>%
arrange(value) %>%
mutate(rank = (1:nrow(.))) %>%
mutate(cdf_als = cumsum(Condition == "ALS") - (rank / 2)) %>%
mutate(ranked_by = field) %>%
mutate(median_value = median(value))
}
)
} %>%
bind_rows() %>%
ggplot(aes(x = rank, y = cdf_als, shape = Condition, color = Source)) +
labs(x = "Sample number", y = "(Cumulative number of ALS cases) - Expected") +
facet_wrap(~ranked_by, nrow = 1) +
geom_point(size = 2, alpha = 0.7) +
scale_color_brewer(palette = "Set1") +
scale_x_continuous(breaks = seq(0, 100, by = 10))
In this section we use Darmanis et al as the reference dataset.
Please select one of the methods:
First, repackage the data as ExpressionSet following the Bisque vignette. This will also be used in other methods.
fgcz_eset <- Biobase::ExpressionSet(
# Subset to marker genes for residual norm computation
assayData = as.matrix(fgcz_data[abm1_markergenes, ])
)
darm_eset <- Biobase::ExpressionSet(
# Expression data
assayData = as.matrix(darm_data),
# Metadata
phenoData = Biobase::AnnotatedDataFrame(
data = data.frame(
row.names = rownames(darm_meta),
cellType = darm_meta$celltype,
SubjectName = darm_meta$experiment_sample_name,
check.names = FALSE,
check.rows = FALSE,
stringsAsFactors = FALSE
),
varMetadata = data.frame(
row.names = c("cellType", "SubjectName"),
labelDescription = c("cellType", "SubjectName")
)
)
)
Deconvolution.
bisque_report <-
BisqueRNA::ReferenceBasedDecomposition(
# BULK DATA
bulk.eset = fgcz_eset,
# REFERENCE
sc.eset = darm_eset,
#
use.overlap = FALSE,
markers = abm1_markergenes,
verbose = FALSE
)
# Fields of the Bisque result
bisque_report %>%
summary() %>%
kable()
| Length | Class | Mode | |
|---|---|---|---|
| bulk.props | 612 | -none- | numeric |
| sc.props | 48 | -none- | numeric |
| rnorm | 102 | -none- | numeric |
| genes.used | 109 | -none- | character |
| transformed.bulk | 11118 | -none- | numeric |
Bisque returns proportions that sum to one:
stopifnot(max(abs(1 - colSums(bisque_report$bulk.props))) <= 1e-10)
fgcz_darm_bisque <- bisque_report$bulk.props
Cluster bulk samples by composition.
samples_order <-
fgcz_darm_bisque %>%
{
(.)[, hclust(dist(t(.)))$order]
} %>%
colnames()
Save to disk and visualize inferred cell type composition by bulk sample.
save_heat_bars_data <-
function(., prefix = deparse(substitute(.))) {
sorted.df <- (.)[order(tolower(rownames(.))), samples_order]
filenames <-
(function(ext) paste(prefix, ext, sep = "_")) %>% {
list(
data = (.)("data.csv"),
heat = (.)("heat.png"),
bars = (.)("bars.png"),
path = out_file("")
)
}
# Attach the clinical Condition to sample IDs
sorted.df.tagged <-
sorted.df %>%
t() %>%
as.data.frame() %>%
ind2col("Name") %>%
mutate(suffix = if_else(fgcz_meta[Name, "Condition"] == "ALS", " (ALS)", "")) %>%
mutate(Name = paste0(Name, suffix)) %>%
select(-suffix) %>%
by_col1() %>%
t()
# Now plot as heatmap and as barplot
sorted.df.tagged %>%
save_heatmap(filenames$heat) %>%
save_barplot(filenames$bars)
# Save table to disk
sorted.df %>%
as.data.frame() %>%
ind2col("celltype") %>%
to_csv(filenames$data)
return(filenames)
}
save_heat_bars_data(fgcz_darm_bisque) %>% hush()
The cell type proportions returned by Bisque vary rather more than expected. The samples come from four different hospitals (the field fgcz_meta$Source), which could introduce the strongest batch effect. We can also compare that to the RIN (RNA integrity number). Age could be another important factor.
PC1_and_friends <-
function(.) {
plots <-
prcomp(t(.))$x %>%
as.data.frame() %>%
assert_fgcz_meta_order() %>%
cbind(fgcz_meta) %>%
cbind(y = (2 * rank((.)$PC1) / length((.)$PC1) - 1)) %>%
{
# Make several plots silently
lapply(
c("PC2", "RIN", "Age"),
function(field) {
cbind((.), x = (.)[[field]]) %>% {
ggplot(., aes(x = x, y = y, shape = Condition, color = Source)) +
labs(
x = field,
y = "Deviation from the average composition (~PC1)",
color = "Sample source",
shape = "Condition"
) +
geom_point(size = 5, alpha = 0.2 + abs((.)$y)) +
scale_color_brewer(palette = "Set1")
}
}
)
}
t <- theme(
legend.position = "none",
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
)
filename <- out_file(paste(
deparse(substitute(.)),
paste(lapply(plots, function(plot) plot$labels$x), collapse = "_"),
"vs__PC1.png",
sep = "__"
))
plots <- gridExtra::arrangeGrob(
grobs = c(
# Remove redundant stuff
lapply(plots, function(.) (. + t)),
# but keep one legend
list(cowplot::get_legend(plots[[1]]))
),
# All grobs in one row
nrow = 1,
# Y-axis label
left = grid::grid.text(
plots[[1]]$labels$y,
rot = 90,
draw = FALSE,
gp = grid::gpar(fontsize = ggplot2::theme_get()$text$size)
)
)
ggsave(
filename = filename,
plot = plots,
width = WIDTH1,
height = WIDTH1 / (length(plots) - 1)
)
return(filename)
}
Principal components of cell type proportions, etc.
PC1_and_friends(fgcz_darm_bisque) %>% hush()
# Prevent errors from MuSiC
exprs <- xbioc::exprs
pVar <- xbioc::pVar
music_report.darm <-
MuSiC::music_prop(
bulk.eset = fgcz_eset,
sc.eset = darm_eset,
clusters = darm_meta$celltype,
samples = names(darm_data)
) %>%
suppressMessages()
music_report.darm %>%
summary() %>%
kable()
| Length | Class | Mode | |
|---|---|---|---|
| Est.prop.weighted | 612 | -none- | numeric |
| Est.prop.allgene | 612 | -none- | numeric |
| Weight.gene | 11118 | -none- | numeric |
| r.squared.full | 102 | -none- | numeric |
| Var.prop | 612 | -none- | numeric |
Estimate cell type proportions (celltype x sample).
fgcz_darm_music <-
t(music_report.darm$Est.prop.weighted) %>% {
(.)[rownames(fgcz_darm_bisque), ]
}
save_heat_bars_data(fgcz_darm_music) %>% hush()
PC1_and_friends(fgcz_darm_music) %>% hush()
We would like to know whether there is the cell type composition differs in ALS vs Control samples. First, the following figure shows the distribution of the first principal component (of cell type proportions) separated by sample source.
fgcz_darm_music.pca <-
fgcz_darm_music %>%
{
as.data.frame(prcomp(t(.))$x)
} %>%
assert_fgcz_meta_order() %>%
cbind(fgcz_meta)
fgcz_darm_music.pca %>% {
ggplot(., aes(x = PC1, fill = Source, linetype = Source)) +
labs(fill = "Sample source") +
guides(linetype = FALSE) +
geom_density(
position = "identity", alpha = 0.4
) +
geom_histogram(
aes(y = stat(density) / 5),
position = "dodge", bins = 20, color = 1, linetype = 1
) +
hist_theme +
theme(axis.text.y = element_blank()) +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
) +
scale_fill_brewer(palette = "Set1")
}
The separation by Source seen in the above figure suggests looking at the Condition (ALS vs Control) for each Source individually.
fgcz_darm_music.pca %>%
{
ggplot(., aes(x = PC1, fill = Condition)) +
geom_density(
position = "identity", alpha = 0.4
) +
geom_histogram(
aes(y = stat(density) / 5),
position = "dodge", bins = 20, color = 1, linetype = 1
) +
hist_theme +
theme(
axis.text.x = element_blank(),
axis.text.y = element_blank()
) +
scale_fill_brewer(palette = "Set1") +
facet_grid(cols = vars(Source))
} %>%
ggsave(
filename = out_file("fgcz_darm_music_PC1.png"),
width = WIDTH1,
height = WIDTH1 / 4
) %>%
hush()
Among those, samples from Lo show a conspicuous further separation by Condition, supported by the following two-sided two-sample t-test.
fgcz_darm_music.pca %>%
filter(Source != "PD") %>%
group_by(Source) %>%
group_map(~ {
t.test(
x = filter((.x), Condition == "ALS")$PC1,
y = filter((.x), Condition != "ALS")$PC1,
alternative = "two"
) %>% {
list(
`Sample source` = dplyr::pull((.y)),
`p-value` = round((.)$p.value, 3),
`t-statistic` = round((.)$statistic, 2),
`#dof` = round((.)$parameter, 1)
)
}
}) %>%
data.table::rbindlist() %>%
kable()
| Sample source | p-value | t-statistic | #dof |
|---|---|---|---|
| Lo | 0.023 | -2.48 | 18.2 |
| NB | 0.672 | -0.43 | 17.0 |
| Ox | 0.616 | -0.51 | 21.6 |
Ordering the samples by the first principal component reveals its meaning as a measure of the content of astrocytes vs neurons. The samples from Lo suggest that high proportion of neurons is associated with ALS.
fgcz_darm_music %>%
t() %>%
as.data.frame() %>%
assert_fgcz_meta_order() %>%
cbind(Source = fgcz_meta$Source) %>%
cbind(PC1 = fgcz_darm_music.pca$PC1) %>%
tibble::rownames_to_column("ID") %>%
mutate(ID = paste(ID, if_else(fgcz_meta$Condition == "ALS", "(ALS)", ""))) %>%
arrange(PC1) %>%
select(-PC1) %>%
group_by(Source) %>%
group_map(~ {
(.x) %>%
tibble::column_to_rownames("ID") %>%
t() %>%
save_barplot(paste("fgcz_darm_music_", pull(.y), "_bars.png", sep = ""))
}) %>%
hush()
Lo:
NB:
Ox:
PD:
Let’s have a look again at the quality metrics, this time annotated by the estimated fraction of astrocytes.
fgcz_darm_music %>%
t() %>%
as.data.frame() %>%
assert_fgcz_meta_order() %>%
cbind(fgcz_meta) %>%
ggplot(aes(x = pmDelay, y = RIN, shape = Condition, color = Source, alpha = 100 * astrocytes)) +
scale_x_log10() +
labs(
x = "Post-mortem delay (hours)",
y = "RIN",
color = "Sample source",
shape = "Condition",
alpha = "Astrocytes, %"
) +
geom_point(size = 5) +
scale_color_brewer(palette = "Set1")
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 4 rows containing missing values (geom_point).
A trend towards a higher content of (intact) astrocytes at lower RIN is now visible:
fgcz_darm_music %>%
t() %>%
as.data.frame() %>%
assert_fgcz_meta_order() %>%
cbind(fgcz_meta) %>%
ggplot(aes(x = RIN, y = 100 * astrocytes, shape = Condition, color = Source)) +
scale_x_log10() +
labs(
y = "Estimated astrocyte content, %",
color = "Sample source",
shape = "Condition"
) +
geom_point(size = 5, alpha = 0.8) +
scale_color_brewer(palette = "Set1")
## Warning: Transformation introduced infinite values in continuous x-axis
For the code and additional analysis see here.
fgcz_darm_nnls <-
path_to("code/*/*-NNLS_Darmanis/a_*/celltypes.csv") %>%
# The first column is not unique
# Keep header as is, e.g. "H-0C083"
utils::read.delim(., check.names = F) %>%
# `cell type` ~> `celltype`
fix_names() %>%
# Collapse cell types:
group_by_and_sum("celltype")
save_heat_bars_data(fgcz_darm_nnls) %>% hush()
PC1_and_friends(fgcz_darm_nnls) %>% hush()
These results are similar enough to MuSiC’s to warrant a closer look. For each sample we compute the cosine similarity between the two methods and plot a histogram of those separated by sample source.
similarity_plot <- function(X1, X2) {
X1 <- X1[order(rownames(X1)), order(colnames(X1))]
X2 <- X2[order(rownames(X2)), order(colnames(X2))]
# Check that samples are arranged consistently
stopifnot(all(colnames(X1) == colnames(X2)))
stopifnot(all(rownames(X1) == rownames(X2)))
colSums(col_norm2(X1) * col_norm2(X2)) %>%
data.frame(
x = .,
s = fgcz_meta$Source
) %>%
{
ggplot(., aes(x = x, fill = s, linetype = s)) +
labs(fill = "Sample source") +
guides(linetype = FALSE) +
geom_density(
position = "identity", alpha = 0.4
) +
geom_histogram(
aes(y = stat(density) / 5),
position = "dodge", bins = 30, color = 1, linetype = 1,
) +
hist_theme +
theme(axis.text.y = element_blank()) +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
) +
scale_fill_brewer(palette = "Set1")
}
}
similarity_plot(fgcz_darm_music, fgcz_darm_nnls) +
labs(x = "Cosine similarity of a sample (MuSiC vs NNLS)")
For the code see here.
fgcz_darm_sieve <-
path_to("code/*/*-RNASieve/a_*/celltypes.csv") %>%
from_csv() %>%
fix_names()
save_heat_bars_data(fgcz_darm_sieve) %>% hush()
PC1_and_friends(fgcz_darm_sieve) %>% hush()
similarity_plot(fgcz_darm_music, fgcz_darm_sieve) +
labs(x = "Cosine similarity of a sample (MuSiC vs RNA-Sieve)")
Preprocesses the reference dataset.
# This takes some 20min
prediction <- ROCR::prediction
performance <- ROCR::performance
zlm <- MAST::zlm
dwls_darm_signature <-
DWLS::buildSignatureMatrixMAST(
scdata = darm_data,
id = setNames(as.list(darm_meta[["celltype"]]), rownames(darm_meta)),
path = out_file("dwls_darm_signature/")
)
Apply DWLS deconvolution to each bulk sample.
fgcz_darm_dwls <-
fgcz_data %>%
{
genes <- intersect(rownames(dwls_darm_signature), rownames(.))
pbapply::pblapply(
(.)[genes, ],
function(sample) {
# Silence silly printout
purrr::quietly(DWLS::solveDampenedWLS)(
as.matrix(dwls_darm_signature[genes, ]), as.matrix(sample)
)$result
},
# Parallel processes -- requires (cl x RAM)
cl = 1
)
} %>%
as.data.frame(check.names = F)
fgcz_darm_dwls <-
fgcz_darm_dwls %>% {
fgcz_darm_dwls[order(tolower(rownames(.))), ]
}
fgcz_darm_dwls %>%
save_heat_bars_data("fgcz_darm_dwls") %>%
hush()
similarity_plot(fgcz_darm_music, fgcz_darm_dwls) +
labs(x = "Cosine similarity of a sample (MuSiC vs DWLS)")
In this section we use Allen Brain M1 as the reference dataset.
Repackage:
abm1_eset <- Biobase::ExpressionSet(
# Expression data
assayData = as.matrix(abm1_data),
# Metadata
phenoData = Biobase::AnnotatedDataFrame(
data = data.frame(
row.names = rownames(abm1_meta),
cellType = abm1_meta$celltype,
SubjectName = abm1_meta$donor,
check.names = FALSE,
check.rows = FALSE
),
varMetadata = data.frame(
row.names = c("cellType", "SubjectName"),
labelDescription = c("cellType", "SubjectName")
)
)
)
Deconvolution:
bisque_report <-
BisqueRNA::ReferenceBasedDecomposition(
# BULK DATA
bulk.eset = fgcz_eset,
# REFERENCE
sc.eset = abm1_eset,
#
use.overlap = FALSE,
markers = abm1_markergenes,
verbose = FALSE
)
## Warning in BisqueRNA::ReferenceBasedDecomposition(bulk.eset = fgcz_eset, :
## Only two individuals detected in single-cell data. While Bisque will run, we
## recommend at least three subjects for reliable performance.
bisque_report$bulk.props %>%
save_heat_bars_data("fgcz_abm1_bisque") %>%
hush()
# Read from disk if already available
# otherwise it takes ages
if (!file.exists(out_file("fgcz_abm1_music_data.csv"))) {
# MuSiC with ABM1
music_report.abm1 <-
MuSiC::music_prop(
bulk.eset = fgcz_eset,
sc.eset = abm1_eset,
clusters = abm1_meta$celltype,
samples = names(abm1_data)
) %>%
suppressMessages()
fgcz_abm1_music <-
t(music_report.abm1$Est.prop.weighted)
} else {
fgcz_abm1_music <-
from_csv(out_file("fgcz_abm1_music_data.csv"))
}
fgcz_abm1_music %>%
save_heat_bars_data("fgcz_abm1_music") %>%
hush()
(in progress)
(in progress)
R Andreev, Cell type deconvolution from bulk RNA-seq of the human prefrontal cortex, 2021-01-09, http://bit.ly/deco_ra.